UseR 2025 - Duke
Duke University
parsermdimplements a C++ parser and abstract syntax tree (AST) for Quarto and R Markdown documents in R.
supports manipulating ASTs (filtering, editing, etc.)
nodes classes use S7 for validation and dispatch
ability to directly source and render ASTs
off-and-on project since covid, original use case was to aide in the grading for a large machine learning course
v0.1.3 is on CRAN, v0.2.0 with full Quarto support on GitHub (CRAN soon*)
Quarto examples today, but everything works with RMarkdown
hello.qmd---
title: "Hello, Quarto"
format:
html:
self-contained: true
---
```{r}
#| label: load-packages
#| include: false
library(tidyverse)
library(palmerpenguins)
```
## Meet Quarto
Quarto enables you to weave together content and executable code into a finished document.
To learn more about Quarto see <https://quarto.org>.
## Meet the penguins
{style="float:right;" fig-alt="Illustration of three species of Palmer Archipelago penguins: Chinstrap, Gentoo, and Adelie. Artwork by @allison_horst." width="401"}
The `penguins` data from the [**palmerpenguins**](https://allisonhorst.github.io/palmerpenguins "palmerpenguins R package")
package contains size measurements for `{r} nrow(penguins)` penguins from three species
observed on three islands in the Palmer Archipelago, Antarctica.
The plot below shows the relationship between flipper and bill lengths of these penguins.
```{r}
#| label: plot-penguins
#| warning: false
#| echo: false
ggplot(penguins,
aes(x = flipper_length_mm, y = bill_length_mm)) +
geom_point(aes(color = species, shape = species)) +
scale_color_manual(values = c("darkorange","purple","cyan4")) +
labs(
title = "Flipper and bill length",
subtitle = "Dimensions for penguins at Palmer Station LTER",
x = "Flipper length (mm)", y = "Bill length (mm)",
color = "Penguin species", shape = "Penguin species"
) +
theme_minimal()
```
## Other Quarto features
### Fenced divs
:::{.callout-note}
Note that there are five types of callouts, including:
`note`, `tip`, `warning`, `caution`, and `important`.
:::
### Markdown code blocks
Some sample python code,
```python
import numpy as np
import matplotlib.pyplot as plt
r = np.arange(0, 2, 0.01)
theta = 2 * np.pi * r
fig, ax = plt.subplots(
subplot_kw = {'projection': 'polar'}
)
ax.plot(theta, r)
ax.set_rticks([0.5, 1, 1.5, 2])
ax.grid(True)
plt.show()
```
### Short codes
Shortcodes are special markdown directives that generate various types of content,
{{< lipsum 1 >}}├── YAML [2 fields]
├── Markdown [1 line]
├── Chunk [r, 4 lines] - load-packages
├── Heading [h2] - Meet Quarto
├── Markdown [1 line]
├── Heading [h2] - Meet the penguins
├── Markdown [7 lines]
├── Chunk [r, 12 lines] - plot-penguins
├── Heading [h2] - Other Quarto features
├── Heading [h3] - Fenced divs
├── Open Fenced div [.callout-note]
├── Markdown [2 lines]
├── Close Fenced div
├── Heading [h3] - Markdown code blocks
├── Markdown [1 line]
├── Code block [python, 12 lines]
├── Heading [h3] - Short codes
└── Markdown [3 lines]
├── YAML [2 fields]
├── Markdown [1 line]
├── Chunk [r, 4 lines] - load-packages
├── Heading [h2] - Meet Quarto
│ └── Markdown [1 line]
├── Heading [h2] - Meet the penguins
│ ├── Markdown [7 lines]
│ └── Chunk [r, 12 lines] - plot-penguins
└── Heading [h2] - Other Quarto features
├── Heading [h3] - Fenced divs
│ ├── Open Fenced div [.callout-note]
│ │ └── Markdown [2 lines]
│ └── Close Fenced div
├── Heading [h3] - Markdown code blocks
│ ├── Markdown [1 line]
│ └── Code block [python, 12 lines]
└── Heading [h3] - Short codes
└── Markdown [3 lines]
Assuming a hierarchy lets us use a CSS selector like approach to target specific nodes based on headings and their descendants,
as_document()ASTs and nodes can be converted back to Quarto documents,
---
title: Hello, Quarto
format:
html:
self-contained: true
---
## Meet Quarto
Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see <https://quarto.org>.
rmd_select() and helpers are built using tidyselect (multiple selectors are or’d / unioned together)
by_section()
has_type()
has_label()
has_heading()
has_option()
has_shortcode()
by_fdiv()
├── YAML [2 fields]
├── Chunk [r, 4 lines] - load-packages
├── Heading [h2] - Meet Quarto
├── Heading [h2] - Meet the penguins
│ └── Chunk [r, 12 lines] - plot-penguins
└── Heading [h2] - Other Quarto features
├── Heading [h3] - Fenced divs
│ ├── Open Fenced div [.callout-note]
│ └── Close Fenced div
├── Heading [h3] - Markdown code blocks
│ └── Code block [python, 12 lines]
└── Heading [h3] - Short codes
ASTs can also be directly rendered
rmd_modify() is a recent addition that allows for modifying ASTs in place, the arguments are a node modifying function and then one or more rmd_select() helper functions.
---
title: Hello, Quarto
format:
html:
self-contained: true
---
```{r}
#| label: load-packages
library(tidyverse)
library(palmerpenguins)
```
```{r}
#| label: plot-penguins
#| warning: false
#| echo: false
ggplot(penguins,
aes(x = flipper_length_mm, y = bill_length_mm)) +
geom_point(aes(color = species, shape = species)) +
scale_color_manual(values = c("darkorange","purple","cyan4")) +
labs(
title = "Flipper and bill length",
subtitle = "Dimensions for penguins at Palmer Station LTER",
x = "Flipper length (mm)", y = "Bill length (mm)",
color = "Penguin species", shape = "Penguin species"
) +
theme_minimal()
```
---
title: Hello, Quarto
format:
html:
self-contained: true
---
```{r}
#| label: load-packages
#| echo: true
#| message: false
library(tidyverse)
library(palmerpenguins)
```
```{r}
#| label: plot-penguins
#| warning: false
#| echo: true
#| message: false
ggplot(penguins,
aes(x = flipper_length_mm, y = bill_length_mm)) +
geom_point(aes(color = species, shape = species)) +
scale_color_manual(values = c("darkorange","purple","cyan4")) +
labs(
title = "Flipper and bill length",
subtitle = "Dimensions for penguins at Palmer Station LTER",
x = "Flipper length (mm)", y = "Bill length (mm)",
color = "Penguin species", shape = "Penguin species"
) +
theme_minimal()
```
One file to rule them all
I distribute assignments as GitHub repos that typically contain a README.md and hw1.qmd file.
I inevitably end up having to maintain both hw1/ and hw1-key/ versions of the assignment.
Different repos for different audiences: students vs TAs respectively
Repos have a tendency to drift over time
Single repo with student scaffolding and solution code is ideal for maintenance but clunky for actual work
hw1.qmd---
title: "Homework 3 - Data Analysis with R"
author: "Your Name"
date: "Due: Friday, March 15, 2024"
format: html
execute:
warning: false
message: false
---
## Setup
Load the required packages for this assignment:
```{r setup}
library(tidyverse)
library(palmerpenguins)
```
## Exercise 1: Basic Data Exploration
Examine the `penguins` dataset from the `palmerpenguins` package. Your task is to create a summary of the dataset that shows the number of observations and variables, and identify any missing values.
```{r ex1-student}
# Write your code here to:
# 1. Display the dimensions of the penguins dataset
# 2. Show the structure of the dataset
# 3. Count missing values in each column
```
```{r ex1-key}
# Solution: Basic data exploration
# 1. Display dimensions
cat("Dataset dimensions:", dim(penguins), "\n")
cat("Rows:", nrow(penguins), "Columns:", ncol(penguins), "\n\n")
# 2. Show structure
str(penguins)
# 3. Count missing values
cat("\nMissing values by column:\n")
penguins %>%
summarise(across(everything(), ~ sum(is.na(.))))
```
## Exercise 2: Data Visualization
Create a scatter plot showing the relationship between flipper length and body mass for penguins. Color the points by species and add appropriate labels and a title.
```{r ex2-student}
# Create a scatter plot with:
# - x-axis: flipper_length_mm
# - y-axis: body_mass_g
# - color by species
# - add appropriate labels and title
ggplot(data = penguins, aes(x = ___, y = ___)) +
geom_point(aes(color = ___)) +
labs(
title = "___",
x = "___",
y = "___"
)
```
```{r ex2-key}
# Solution: Scatter plot of flipper length vs body mass
ggplot(data = penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species), alpha = 0.8, size = 2) +
labs(
title = "Penguin Flipper Length vs Body Mass by Species",
x = "Flipper Length (mm)",
y = "Body Mass (g)",
color = "Species"
) +
theme_minimal() +
scale_color_viridis_d()
```
## Exercise 3: Statistical Analysis
Calculate summary statistics for bill length by species. Create a table showing the mean, median, standard deviation, and count for each species.
```{r ex3-student}
# Calculate summary statistics for bill_length_mm by species
# Include: mean, median, standard deviation, and count
# Remove missing values before calculating
penguins %>%
# Add your code here
```
```{r ex3-key}
# Solution: Summary statistics for bill length by species
penguins %>%
filter(!is.na(bill_length_mm)) %>%
group_by(species) %>%
summarise(
count = n(),
mean_bill_length = round(mean(bill_length_mm), 2),
median_bill_length = round(median(bill_length_mm), 2),
sd_bill_length = round(sd(bill_length_mm), 2),
.groups = "drop"
) %>%
arrange(desc(mean_bill_length))
```
## Exercise 4: Advanced Data Manipulation
Filter the dataset to include only penguins with complete data (no missing values), then create a new variable called `bill_ratio` that represents the ratio of bill length to bill depth. Finally, identify which species has the highest average bill ratio.
```{r ex4-student}
# Step 1: Filter for complete cases
# Step 2: Create bill_ratio variable (bill_length_mm / bill_depth_mm)
# Step 3: Calculate average bill_ratio by species
# Step 4: Identify species with highest average ratio
```
```{r ex4-key}
# Solution: Advanced data manipulation
complete_penguins = penguins %>%
# Remove rows with any missing values
filter(complete.cases(.)) %>%
# Create bill_ratio variable
mutate(bill_ratio = bill_length_mm / bill_depth_mm)
# Calculate average bill ratio by species
bill_ratio_summary = complete_penguins %>%
group_by(species) %>%
summarise(
avg_bill_ratio = round(mean(bill_ratio), 3),
n = n(),
.groups = "drop"
) %>%
arrange(desc(avg_bill_ratio))
print(bill_ratio_summary)
# Identify species with highest average bill ratio
highest_ratio_species = bill_ratio_summary %>%
slice_max(avg_bill_ratio, n = 1) %>%
pull(species)
cat("\nSpecies with highest average bill ratio:", as.character(highest_ratio_species))
```
## Bonus Exercise: Conditional Logic
Write a function that categorizes penguins as "small", "medium", or "large" based on their body mass. Use the following criteria:
- Small: body mass < 3500g
- Medium: body mass between 3500g and 4500g
- Large: body mass > 4500g
Apply this function to create a new column and create a summary table.
```{r bonus-student}
# Create a function to categorize penguins by size
categorize_size = function(mass) {
# Add your conditional logic here
}
# Apply the function and create summary
```
```{r bonus-key}
# Solution: Conditional logic for size categorization
categorize_size = function(mass) {
case_when(
is.na(mass) ~ "Unknown",
mass < 3500 ~ "Small",
mass >= 3500 & mass <= 4500 ~ "Medium",
mass > 4500 ~ "Large"
)
}
# Apply the function and create summary
penguins_with_size = penguins %>%
mutate(size_category = categorize_size(body_mass_g))
# Create summary table
size_summary = penguins_with_size %>%
count(species, size_category) %>%
pivot_wider(names_from = size_category, values_from = n, values_fill = 0)
print(size_summary)
# Overall size distribution
penguins_with_size %>%
count(size_category) %>%
mutate(percentage = round(n / sum(n) * 100, 1))
```├── YAML [5 fields]
├── Heading [h2] - Setup
│ ├── Markdown [1 line]
│ └── Chunk [r, 2 lines] - setup
├── Heading [h2] - Exercise 1: Basic Data Exploration
│ ├── Markdown [1 line]
│ ├── Chunk [r, 5 lines] - ex1-student
│ └── Chunk [r, 12 lines] - ex1-key
├── Heading [h2] - Exercise 2: Data Visualization
│ ├── Markdown [1 line]
│ ├── Chunk [r, 13 lines] - ex2-student
│ └── Chunk [r, 11 lines] - ex2-key
├── Heading [h2] - Exercise 3: Statistical Analysis
│ ├── Markdown [1 line]
│ ├── Chunk [r, 7 lines] - ex3-student
│ └── Chunk [r, 12 lines] - ex3-key
├── Heading [h2] - Exercise 4: Advanced Data Manipulation
│ ├── Markdown [1 line]
│ ├── Chunk [r, 5 lines] - ex4-student
│ └── Chunk [r, 25 lines] - ex4-key
└── Heading [h2] - Bonus Exercise: Conditional Logic
├── Markdown [6 lines]
├── Chunk [r, 7 lines] - bonus-student
└── Chunk [r, 25 lines] - bonus-key
├── YAML [5 fields]
├── Heading [h2] - Setup
│ ├── Markdown [1 line]
│ └── Chunk [r, 2 lines] - setup
├── Heading [h2] - Exercise 1: Basic Data Exploration
│ ├── Markdown [1 line]
│ └── Chunk [r, 5 lines] - ex1
├── Heading [h2] - Exercise 2: Data Visualization
│ ├── Markdown [1 line]
│ └── Chunk [r, 13 lines] - ex2
├── Heading [h2] - Exercise 3: Statistical Analysis
│ ├── Markdown [1 line]
│ └── Chunk [r, 7 lines] - ex3
├── Heading [h2] - Exercise 4: Advanced Data Manipulation
│ ├── Markdown [1 line]
│ └── Chunk [r, 5 lines] - ex4
└── Heading [h2] - Bonus Exercise: Conditional Logic
├── Markdown [6 lines]
└── Chunk [r, 7 lines] - bonus
├── YAML [5 fields]
├── Chunk [r, 2 lines] - setup
├── Heading [h2] - Exercise 1: Basic Data Exploration
│ └── Chunk [r, 12 lines] - ex1-key
├── Heading [h2] - Exercise 2: Data Visualization
│ └── Chunk [r, 11 lines] - ex2-key
├── Heading [h2] - Exercise 3: Statistical Analysis
│ └── Chunk [r, 12 lines] - ex3-key
├── Heading [h2] - Exercise 4: Advanced Data Manipulation
│ └── Chunk [r, 25 lines] - ex4-key
└── Heading [h2] - Bonus Exercise: Conditional Logic
└── Chunk [r, 25 lines] - bonus-key
The current version will be going up on CRAN soon (revdep checks still need work, other minor polishing)
Building out and documenting interesting use cases
Building out tools using this infrastructure
Improved ergonomics